Spotify provides language-agnostic “audio DNA” — 10 numerical
features describing every track’s sonic profile (e.g.,
acousticness, energy, danceability).
This project applies modern unsupervised learning to reveal structure
and trends in sonic clusters.
Business use-case: Enable a PM to quickly auto-curate distinct playlists (e.g., “mellow pop”, “Latin EDM party”, “soft ballads”) based on measurable audio features alone — and explain why certain hits occupy their place in the sonic landscape.
Key deliverable:
A descriptive diagnostic that not only highlights clear, actionable
clusters, but also transparently communicates which groups are coherent
vs. candidates for refinement, with all outputs designed to feed into an
interactive Shiny app dashboard.
# Load necessary packages and read in df
df <- read.csv(file.path(root,"data/spotifyfeatures.csv"), header=TRUE)
# Examine data
head(df)
## genre artist_name track_name
## 1 Movie Henri Salvador C'est beau de faire un Show
## 2 Movie Martin & les fées Perdu d'avance (par Gad Elmaleh)
## 3 Movie Joseph Williams Don't Let Me Be Lonely Tonight
## 4 Movie Henri Salvador Dis-moi Monsieur Gordon Cooper
## 5 Movie Fabien Nataf Ouverture
## 6 Movie Henri Salvador Le petit souper aux chandelles
## track_id popularity acousticness danceability duration_ms
## 1 0BRjO6ga9RKCKjfDqeFgWV 0 0.611 0.389 99373
## 2 0BjC1NfoEOOusryehmNudP 1 0.246 0.590 137373
## 3 0CoSDzoNIKCRs124s9uTVy 3 0.952 0.663 170267
## 4 0Gc6TVm52BwZD07Ki6tIvf 0 0.703 0.240 152427
## 5 0IuslXpMROHdEPvSl1fTQK 4 0.950 0.331 82625
## 6 0Mf1jKa8eNAf1a4PwTbizj 0 0.749 0.578 160627
## energy instrumentalness key liveness loudness mode speechiness tempo
## 1 0.9100 0.000 C# 0.3460 -1.828 Major 0.0525 166.969
## 2 0.7370 0.000 F# 0.1510 -5.559 Minor 0.0868 174.003
## 3 0.1310 0.000 C 0.1030 -13.879 Minor 0.0362 99.488
## 4 0.3260 0.000 C# 0.0985 -12.178 Major 0.0395 171.758
## 5 0.2250 0.123 F 0.2020 -21.150 Major 0.0456 140.576
## 6 0.0948 0.000 C# 0.1070 -14.970 Major 0.1430 87.479
## time_signature valence
## 1 4/4 0.814
## 2 4/4 0.816
## 3 5/4 0.368
## 4 4/4 0.227
## 5 4/4 0.390
## 6 4/4 0.358
# Remove duplicated tracks that could result in skewing
df <- df[!duplicated(df$track_id), ]
# Remove non numerical variables
df_ready_for_pca <- df[, !names(df) %in% c('popularity','genre', 'artist_name', 'track_name', 'track_id', 'key', 'mode', 'time_signature')]
# Examine Data
head(df_ready_for_pca)
## acousticness danceability duration_ms energy instrumentalness liveness
## 1 0.611 0.389 99373 0.9100 0.000 0.3460
## 2 0.246 0.590 137373 0.7370 0.000 0.1510
## 3 0.952 0.663 170267 0.1310 0.000 0.1030
## 4 0.703 0.240 152427 0.3260 0.000 0.0985
## 5 0.950 0.331 82625 0.2250 0.123 0.2020
## 6 0.749 0.578 160627 0.0948 0.000 0.1070
## loudness speechiness tempo valence
## 1 -1.828 0.0525 166.969 0.814
## 2 -5.559 0.0868 174.003 0.816
## 3 -13.879 0.0362 99.488 0.368
## 4 -12.178 0.0395 171.758 0.227
## 5 -21.150 0.0456 140.576 0.390
## 6 -14.970 0.1430 87.479 0.358
Since the units are incommensurate, we choose to standardize the dataset to variables with large variances (such as duration_ms and tempo) do not dominate
data_scaled <- scale(df_ready_for_pca)
First, I will observe the scree plot of the correlation matrix of the standardized data and use the elbow method to identify the number of components.
# Plot PCA results
scree(cor(data_scaled), factors=FALSE)
Utilizing the Kaiser Criterion, we can see that 3 components have
eigenvalues greater than 1, suggesting we will need 3 components.
However, since the fourth component has an eigenvalue close to 1, I will
also investigate it to see if the loadings are better. Also, visually,
by looking at the elbow of the scree plot, we can see that it appears at
3 components.
# 1 factor:
fa_result <- factanal(data_scaled, factors = 1, rotation = "varimax")
print(fa_result, cutoff=0.3, digits=3, sort=TRUE)
##
## Call:
## factanal(x = data_scaled, factors = 1, rotation = "varimax")
##
## Uniquenesses:
## acousticness danceability duration_ms energy
## 0.403 0.761 0.997 0.185
## instrumentalness liveness loudness speechiness
## 0.754 0.988 0.170 0.996
## tempo valence
## 0.926 0.746
##
## Loadings:
## Factor1
## acousticness -0.773
## energy 0.903
## loudness 0.911
## valence 0.504
## danceability 0.489
## duration_ms
## instrumentalness -0.496
## liveness
## speechiness
## tempo
##
## Factor1
## SS loadings 3.074
## Proportion Var 0.307
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 229747.2 on 35 degrees of freedom.
## The p-value is 0
pca_result <- principal(data_scaled, nfactors=1, rotate="varimax")
print(pca_result$loadings, cutoff=0.4, digits=3, sort=TRUE)
##
## Loadings:
## PC1
## acousticness -0.777
## danceability 0.661
## energy 0.866
## instrumentalness -0.617
## loudness 0.883
## valence 0.669
## duration_ms
## liveness
## speechiness
## tempo
##
## PC1
## SS loadings 3.542
## Proportion Var 0.354
# For correlation
model2 <- principal(data_scaled, nfactors = 1, rotate = 'promax')
# Correlation for promax
cor(model2$scores)
## PC1
## PC1 1
# 2 factor:
fa_result <- factanal(data_scaled, factors = 2, rotation = "varimax")
print(fa_result, cutoff=0.3, digits=3, sort=TRUE)
##
## Call:
## factanal(x = data_scaled, factors = 2, rotation = "varimax")
##
## Uniquenesses:
## acousticness danceability duration_ms energy
## 0.318 0.766 0.998 0.143
## instrumentalness liveness loudness speechiness
## 0.741 0.634 0.188 0.126
## tempo valence
## 0.908 0.751
##
## Loadings:
## Factor1 Factor2
## acousticness -0.805
## energy 0.910
## loudness 0.901
## liveness 0.600
## speechiness 0.935
## danceability 0.469
## duration_ms
## instrumentalness -0.468
## tempo
## valence 0.498
##
## Factor1 Factor2
## SS loadings 3.062 1.366
## Proportion Var 0.306 0.137
## Cumulative Var 0.306 0.443
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 114774.6 on 26 degrees of freedom.
## The p-value is 0
pca_result <- principal(data_scaled, nfactors=2, rotate="varimax")
print(pca_result$loadings, cutoff=0.35, digits=3, sort=TRUE)
##
## Loadings:
## RC1 RC2
## acousticness -0.795
## danceability 0.657
## energy 0.860
## instrumentalness -0.600
## loudness 0.886
## valence 0.671
## liveness 0.835
## speechiness 0.876
## duration_ms
## tempo
##
## RC1 RC2
## SS loadings 3.534 1.702
## Proportion Var 0.353 0.170
## Cumulative Var 0.353 0.524
# For correlation
model2 <- principal(data_scaled, nfactors = 2, rotate = 'promax')
# Correlation for promax
cor(model2$scores)
## RC1 RC2
## RC1 1.00000000 0.06440049
## RC2 0.06440049 1.00000000
# 3 factor:
fa_result <- factanal(data_scaled, factors = 3, rotation = "varimax")
print(fa_result, cutoff=0.3, digits=3, sort=TRUE)
##
## Call:
## factanal(x = data_scaled, factors = 3, rotation = "varimax")
##
## Uniquenesses:
## acousticness danceability duration_ms energy
## 0.313 0.005 0.984 0.048
## instrumentalness liveness loudness speechiness
## 0.756 0.570 0.234 0.196
## tempo valence
## 0.904 0.587
##
## Loadings:
## Factor1 Factor2 Factor3
## acousticness -0.783
## energy 0.921
## loudness 0.819 0.304
## danceability 0.977
## valence 0.369 0.526
## liveness 0.651
## speechiness 0.881
## duration_ms
## instrumentalness -0.333 -0.306
## tempo
##
## Factor1 Factor2 Factor3
## SS loadings 2.520 1.540 1.345
## Proportion Var 0.252 0.154 0.135
## Cumulative Var 0.252 0.406 0.540
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 35176.84 on 18 degrees of freedom.
## The p-value is 0
pca_result <- principal(data_scaled, nfactors=3, rotate="varimax")
print(pca_result$loadings, cutoff=0.4, digits=3, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3
## acousticness -0.846
## energy 0.860
## loudness 0.858
## liveness 0.858
## speechiness 0.868
## danceability 0.729
## duration_ms -0.640
## valence 0.682
## instrumentalness -0.442
## tempo 0.479
##
## RC1 RC2 RC3
## SS loadings 2.940 1.730 1.719
## Proportion Var 0.294 0.173 0.172
## Cumulative Var 0.294 0.467 0.639
# For correlation
model2 <- principal(data_scaled, nfactors = 3, rotate = 'promax')
print(model2$loadings, cutoff=0.3, digits=3, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3
## acousticness -0.852
## energy 0.866
## loudness 0.846
## tempo 0.520
## liveness 0.870
## speechiness 0.870
## danceability 0.709
## duration_ms 0.340 -0.713
## valence 0.655
## instrumentalness -0.395 -0.315
##
## RC1 RC2 RC3
## SS loadings 2.876 1.724 1.634
## Proportion Var 0.288 0.172 0.163
## Cumulative Var 0.288 0.460 0.623
# Correlation for promax
cor(model2$scores)
## RC1 RC2 RC3
## RC1 1.00000000 0.07074425 0.3302208
## RC2 0.07074425 1.00000000 0.1459794
## RC3 0.33022081 0.14597937 1.0000000
# 4 factor:
fa_result <- factanal(data_scaled, factors = 4, rotation = "varimax")
print(fa_result, cutoff=0.3, digits=3, sort=TRUE)
##
## Call:
## factanal(x = data_scaled, factors = 4, rotation = "varimax")
##
## Uniquenesses:
## acousticness danceability duration_ms energy
## 0.331 0.005 0.984 0.005
## instrumentalness liveness loudness speechiness
## 0.676 0.563 0.005 0.215
## tempo valence
## 0.907 0.579
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## acousticness -0.763
## energy 0.931
## loudness 0.778 0.571
## danceability 0.970
## valence 0.357 0.541
## liveness 0.656
## speechiness 0.867
## duration_ms
## instrumentalness -0.372
## tempo
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.383 1.504 1.332 0.510
## Proportion Var 0.238 0.150 0.133 0.051
## Cumulative Var 0.238 0.389 0.522 0.573
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 12873.63 on 11 degrees of freedom.
## The p-value is 0
pca_result <- principal(data_scaled, nfactors=4, rotate="varimax")
print(pca_result$loadings, cutoff=0.4, digits=3, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC4 RC3
## acousticness -0.648 -0.507
## danceability 0.818
## energy 0.701 0.510
## instrumentalness -0.604
## loudness 0.762 0.456
## valence 0.755
## liveness 0.877
## speechiness 0.860
## tempo 0.816
## duration_ms 0.909
##
## RC1 RC2 RC4 RC3
## SS loadings 3.112 1.674 1.477 1.063
## Proportion Var 0.311 0.167 0.148 0.106
## Cumulative Var 0.311 0.479 0.626 0.733
# For correlation
model2 <- principal(data_scaled, nfactors = 4, rotate = 'promax')
# Correlation for promax
cor(model2$scores)
## RC1 RC2 RC4 RC3
## RC1 1.00000000 0.142743803 0.334277915 0.08031949
## RC2 0.14274380 1.000000000 0.004226895 0.04641827
## RC4 0.33427791 0.004226895 1.000000000 0.36058434
## RC3 0.08031949 0.046418271 0.360584342 1.00000000
# 5 factor:
fa_result <- factanal(data_scaled, factors = 5, rotation = "varimax")
print(fa_result, cutoff=0.3, digits=3, sort=TRUE)
##
## Call:
## factanal(x = data_scaled, factors = 5, rotation = "varimax")
##
## Uniquenesses:
## acousticness danceability duration_ms energy
## 0.315 0.005 0.973 0.005
## instrumentalness liveness loudness speechiness
## 0.179 0.556 0.212 0.238
## tempo valence
## 0.898 0.005
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4 Factor5
## acousticness -0.786
## energy 0.944
## loudness 0.812
## liveness 0.659
## speechiness 0.855
## valence 0.333 0.937
## danceability 0.479 0.830
## instrumentalness 0.811
## duration_ms
## tempo
##
## Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings 2.508 1.307 1.219 0.810 0.771
## Proportion Var 0.251 0.131 0.122 0.081 0.077
## Cumulative Var 0.251 0.381 0.503 0.584 0.661
##
## Test of the hypothesis that 5 factors are sufficient.
## The chi square statistic is 1620.39 on 5 degrees of freedom.
## The p-value is 0
pca_result <- principal(data_scaled, nfactors=5, rotate="varimax")
print(pca_result$loadings, cutoff=0.4, digits=3, sort=TRUE)
##
## Loadings:
## RC1 RC5 RC2 RC4 RC3
## acousticness -0.867
## energy 0.886
## loudness 0.864
## danceability 0.860
## valence 0.811
## liveness 0.871
## speechiness 0.869
## tempo 0.978
## duration_ms 0.995
## instrumentalness -0.472
##
## RC1 RC5 RC2 RC4 RC3
## SS loadings 2.583 1.837 1.679 1.010 1.000
## Proportion Var 0.258 0.184 0.168 0.101 0.100
## Cumulative Var 0.258 0.442 0.610 0.711 0.811
# For correlation
model2 <- principal(data_scaled, nfactors = 5, rotate = 'promax')
# Correlation for promax
cor(model2$scores)
## RC1 RC5 RC2 RC4 RC3
## RC1 1.0000000 0.5272046 0.116826597 0.28429590 -0.032885502
## RC5 0.5272046 1.0000000 0.158801679 0.13602827 -0.153934144
## RC2 0.1168266 0.1588017 1.000000000 -0.06537388 -0.007637112
## RC4 0.2842959 0.1360283 -0.065373881 1.00000000 -0.040751685
## RC3 -0.0328855 -0.1539341 -0.007637112 -0.04075169 1.000000000
# Conduct the MAP test via the VSS function
vss_result <- VSS(data_scaled, n = 8, rotate = "varimax", fm = "ml")
print(vss_result)
##
## Very Simple Structure
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm,
## n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.73 with 2 factors
## VSS complexity 2 achieves a maximimum of 0.8 with 3 factors
##
## The Velicer MAP achieves a minimum of 0.05 with 1 factors
## BIC achieves a minimum of 1559.97 with 5 factors
## Sample Size adjusted BIC achieves a minimum of 1575.86 with 5 factors
##
## Statistics by number of factors
## vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex
## 1 0.63 0.00 0.053 35 2.3e+05 0 7.1 0.63 0.193 229324 229436 1.0
## 2 0.73 0.76 0.062 26 1.1e+05 0 4.6 0.76 0.158 114460 114543 1.1
## 3 0.64 0.80 0.067 18 3.5e+04 0 3.3 0.83 0.105 34959 35017 1.4
## 4 0.59 0.76 0.124 11 1.3e+04 0 3.1 0.84 0.081 12741 12775 1.5
## 5 0.61 0.79 0.173 5 1.6e+03 0 2.3 0.88 0.043 1560 1576 1.4
## 6 0.62 0.79 0.251 0 3.0e+02 NA 2.1 0.89 NA NA NA 1.4
## 7 0.57 0.74 0.387 -4 2.2e-05 NA 2.4 0.87 NA NA NA 1.8
## 8 0.60 0.75 0.551 -7 1.6e-09 NA 2.2 0.88 NA NA NA 1.7
## eChisq SRMR eCRMS eBIC
## 1 2.4e+05 1.2e-01 0.140 241945
## 2 8.2e+04 7.2e-02 0.094 81682
## 3 1.4e+04 2.9e-02 0.047 13560
## 4 9.0e+03 2.4e-02 0.048 8891
## 5 9.1e+02 7.5e-03 0.023 845
## 6 3.2e+02 4.5e-03 NA NA
## 7 1.9e-05 1.1e-06 NA NA
## 8 2.2e-09 1.2e-08 NA NA
summary(vss_result)
##
## Very Simple Structure
## VSS complexity 1 achieves a maximimum of 0.73 with 2 factors
## VSS complexity 2 achieves a maximimum of 0.8 with 3 factors
##
## The Velicer MAP criterion achieves a minimum of 0.55 with 1 factors
##
fa_result <- factanal(data_scaled[, -c(5)], factors = 3, rotation = "varimax")
print(fa_result, cutoff=0.3, digits=3, sort=TRUE)
##
## Call:
## factanal(x = data_scaled[, -c(5)], factors = 3, rotation = "varimax")
##
## Uniquenesses:
## acousticness danceability duration_ms energy liveness loudness
## 0.326 0.005 0.984 0.005 0.561 0.258
## speechiness tempo valence
## 0.233 0.907 0.584
##
## Loadings:
## Factor1 Factor2 Factor3
## acousticness -0.775
## energy 0.941
## loudness 0.803 0.304
## danceability 0.976
## valence 0.375 0.524
## liveness 0.658
## speechiness 0.861
## duration_ms
## tempo
##
## Factor1 Factor2 Factor3
## SS loadings 2.409 1.438 1.290
## Proportion Var 0.268 0.160 0.143
## Cumulative Var 0.268 0.427 0.571
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 10931.86 on 12 degrees of freedom.
## The p-value is 0
pca_result <- principal(data_scaled[, -c(5)], nfactors=3, rotate="varimax")
print(pca_result$loadings, cutoff=0.4, digits=3, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3
## acousticness -0.854
## energy 0.871
## loudness 0.859
## liveness 0.869
## speechiness 0.874
## danceability 0.727
## duration_ms -0.655
## valence 0.407 0.687
## tempo 0.476
##
## RC1 RC2 RC3
## SS loadings 2.795 1.660 1.584
## Proportion Var 0.311 0.184 0.176
## Cumulative Var 0.311 0.495 0.671
# For correlation
model2 <- principal(data_scaled[, -c(5)], nfactors = 3, rotate = 'promax')
# Correlation for promax
cor(model2$scores)
## RC1 RC2 RC3
## RC1 1.00000000 -0.02459497 0.25390044
## RC2 -0.02459497 1.00000000 0.05871346
## RC3 0.25390044 0.05871346 1.00000000
| ### Why We Chose Three Components |
| We ran PCA with both varimax and promax rotations and saw that the first three factors stood out cleanly. Promax correlations between those three were tiny—meaning they’re essentially independent. When we tried a fourth factor, it just picked up noise (single-variable “stubs”) and barely raised our explained variance <11%, so we stuck with three.The VSS and MAP tests suggest a dimensionality between 1 and 3 factors, with VSS peaking at 3 factors and MAP favoring 1, while BIC suggests 5 factors. My PCA solution using 3 components aligns well with VSS guidance and captures the key structure while balancing simplicity and interpretability. |
This PCA reduces 10 Spotify audio features to three
interpretable axes that explain approximately 63% of the total
variance.
Each component represents a distinct sonic dimension that maps naturally
onto real-world curation and recommendation strategies.
This dimensionality reduction transforms Spotify’s abstract audio feature space into a business-ready, human-interpretable framework for playlist curation, editorial tooling, and consumer-facing features.
Next I will view the plots of the principal components and see if I can identify any obvious shapes or clusters. This will guide my choice of clustering method to use.
# Plot the first principal component against the second
plot(pca_result$scores[,1], pca_result$scores[,2],
xlab="First Principal Component",
ylab="Second Principal Component",
main="PCA: Component 1 vs Component 2")
plot(pca_result$scores[,1], pca_result$scores[,3],
xlab="First Principal Component",
ylab="Third Principal Component",
main="PCA: Component 1 vs Component 3")
plot(pca_result$scores[,2], pca_result$scores[,3],
xlab="Second Principal Component",
ylab="Third Principal Component",
main="PCA: Component 2 vs Component 3")
par(mar=c(8, 8, 8, 8) , cex.main=2, cex.lab=1.5, cex.axis=1.5)
mclust_result <- Mclust(data_scaled)
plot(mclust_result, what = "BIC")
summary(mclust_result)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 9 components:
##
## log-likelihood n df BIC ICL
## -1273096 176774 521 -2552487 -2591162
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 37059 27493 15741 20509 35869 9234 5437 8927 16505
## Cluster cohesion & separation
# 1. Attach cluster labels
df$cluster <- mclust_result$classification
# 2. Track counts per cluster
df %>% count(cluster, name = "tracks") %>%
knitr::kable(
caption = "Number of tracks in each cluster"
)
| cluster | tracks |
|---|---|
| 1 | 37059 |
| 2 | 27493 |
| 3 | 15741 |
| 4 | 20509 |
| 5 | 35869 |
| 6 | 9234 |
| 7 | 5437 |
| 8 | 8927 |
| 9 | 16505 |
# 3. Compute overall variance (mean squared distance to global centroid)
overall_var <- mean(
rowSums((data_scaled - colMeans(data_scaled))^2)
)
# 4. Compute within-cluster variance
# We pull each cluster’s rows into a matrix, compute its centroid,
# then average squared distances back to that centroid.
data_clu <- as_tibble(data_scaled) %>% mutate(cluster = df$cluster)
within_var <- data_clu %>%
group_by(cluster) %>%
summarise(
within_var = {
m <- as.matrix(select(cur_data(), -cluster))
cent <- colMeans(m)
mean(rowSums((m - cent)^2))
},
.groups = "drop"
) %>%
mutate(ratio_vs_overall = round(within_var / overall_var, 2))
# 5. Display tidy table
within_var %>%
knitr::kable(
col.names = c("Cluster", "Within-Cluster Var", "Ratio vs Overall"),
caption = "Within-cluster variance compared to overall variance"
)
| Cluster | Within-Cluster Var | Ratio vs Overall |
|---|---|---|
| 1 | 4.963681 | 0.50 |
| 2 | 9.872430 | 0.99 |
| 3 | 31.416682 | 3.14 |
| 4 | 9.634426 | 0.96 |
| 5 | 8.177021 | 0.82 |
| 6 | 34.693353 | 3.47 |
| 7 | 24.232146 | 2.42 |
| 8 | 15.124148 | 1.51 |
| 9 | 7.222772 | 0.72 |
pairs(as.data.frame(mclust_result$z)[,1:3],
pch = 20, col = df$cluster,
main = "Nine clusters in PC space")
par(mar=c(8, 8, 8, 8) , cex.main=2, cex.lab=1.5, cex.axis=1.5)
plot(mclust_result, what = "classification")
# ── Cluster means and standard deviations ───────────────────────────────
classification <- mclust_result$classification
# Compute per-cluster means and standard deviations
cluster_means <- aggregate(data_scaled,
by = list(cluster = classification),
FUN = mean)
cluster_std <- aggregate(data_scaled,
by = list(cluster = classification),
FUN = sd)
# Display results with clear headings
cat("=== Cluster Means (standardized features) ===\n")
## === Cluster Means (standardized features) ===
print(round(cluster_means, 3))
## cluster acousticness danceability duration_ms energy instrumentalness
## 1 1 0.184 0.260 -0.088 -0.203 -0.528
## 2 2 -0.132 -0.136 0.113 -0.072 1.117
## 3 3 1.420 -1.414 0.224 -1.540 1.990
## 4 4 -0.054 0.144 0.008 0.156 -0.521
## 5 5 -1.000 0.179 -0.132 0.881 -0.529
## 6 6 1.108 0.113 -0.104 0.426 -0.533
## 7 7 -0.255 0.127 0.924 0.259 0.927
## 8 8 1.401 -1.127 0.022 -1.419 -0.465
## 9 9 -0.601 0.929 -0.186 0.379 -0.531
## liveness loudness speechiness tempo valence
## 1 -0.389 0.160 -0.390 0.044 0.247
## 2 -0.212 -0.173 -0.397 0.044 -0.102
## 3 -0.514 -1.826 -0.425 -0.530 -1.201
## 4 1.066 0.149 0.256 0.107 0.190
## 5 -0.145 0.759 -0.302 0.276 0.380
## 6 2.441 -0.299 3.809 -0.629 -0.178
## 7 0.331 0.114 0.329 0.185 -0.039
## 8 -0.349 -1.129 -0.426 -0.409 -1.016
## 9 -0.578 0.576 0.272 0.112 0.360
cat("\n=== Cluster Standard Deviations (standardized features) ===\n")
##
## === Cluster Standard Deviations (standardized features) ===
print(round(cluster_std, 3))
## cluster acousticness danceability duration_ms energy instrumentalness
## 1 1 0.743 0.792 0.529 0.689 0.014
## 2 2 0.945 1.000 0.911 0.931 1.043
## 3 3 0.218 0.737 1.353 0.410 0.547
## 4 4 0.920 0.902 0.761 0.927 0.034
## 5 5 0.117 0.750 0.355 0.466 0.011
## 6 6 0.253 0.457 1.135 0.791 0.005
## 7 7 0.976 1.032 3.644 1.011 1.008
## 8 8 0.210 0.663 0.774 0.387 0.123
## 9 9 0.387 0.657 0.305 0.546 0.008
## liveness loudness speechiness tempo valence
## 1 0.342 0.504 0.115 0.922 0.974
## 2 0.664 0.860 0.076 0.934 1.045
## 3 0.222 1.038 0.032 0.959 0.466
## 4 1.246 0.831 0.676 1.073 0.906
## 5 0.556 0.280 0.188 0.937 0.822
## 6 0.970 0.782 0.347 0.878 0.752
## 7 1.270 0.856 0.864 1.094 0.934
## 8 0.412 0.974 0.039 0.954 0.462
## 9 0.156 0.343 0.574 1.010 0.868
knitr::kable(round(cluster_means, 2), caption = "Cluster Means (z-scores)")
| cluster | acousticness | danceability | duration_ms | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.18 | 0.26 | -0.09 | -0.20 | -0.53 | -0.39 | 0.16 | -0.39 | 0.04 | 0.25 |
| 2 | -0.13 | -0.14 | 0.11 | -0.07 | 1.12 | -0.21 | -0.17 | -0.40 | 0.04 | -0.10 |
| 3 | 1.42 | -1.41 | 0.22 | -1.54 | 1.99 | -0.51 | -1.83 | -0.42 | -0.53 | -1.20 |
| 4 | -0.05 | 0.14 | 0.01 | 0.16 | -0.52 | 1.07 | 0.15 | 0.26 | 0.11 | 0.19 |
| 5 | -1.00 | 0.18 | -0.13 | 0.88 | -0.53 | -0.15 | 0.76 | -0.30 | 0.28 | 0.38 |
| 6 | 1.11 | 0.11 | -0.10 | 0.43 | -0.53 | 2.44 | -0.30 | 3.81 | -0.63 | -0.18 |
| 7 | -0.25 | 0.13 | 0.92 | 0.26 | 0.93 | 0.33 | 0.11 | 0.33 | 0.19 | -0.04 |
| 8 | 1.40 | -1.13 | 0.02 | -1.42 | -0.47 | -0.35 | -1.13 | -0.43 | -0.41 | -1.02 |
| 9 | -0.60 | 0.93 | -0.19 | 0.38 | -0.53 | -0.58 | 0.58 | 0.27 | 0.11 | 0.36 |
knitr::kable(round(cluster_std, 2), caption = "Cluster Std")
| cluster | acousticness | danceability | duration_ms | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.74 | 0.79 | 0.53 | 0.69 | 0.01 | 0.34 | 0.50 | 0.11 | 0.92 | 0.97 |
| 2 | 0.94 | 1.00 | 0.91 | 0.93 | 1.04 | 0.66 | 0.86 | 0.08 | 0.93 | 1.05 |
| 3 | 0.22 | 0.74 | 1.35 | 0.41 | 0.55 | 0.22 | 1.04 | 0.03 | 0.96 | 0.47 |
| 4 | 0.92 | 0.90 | 0.76 | 0.93 | 0.03 | 1.25 | 0.83 | 0.68 | 1.07 | 0.91 |
| 5 | 0.12 | 0.75 | 0.36 | 0.47 | 0.01 | 0.56 | 0.28 | 0.19 | 0.94 | 0.82 |
| 6 | 0.25 | 0.46 | 1.13 | 0.79 | 0.00 | 0.97 | 0.78 | 0.35 | 0.88 | 0.75 |
| 7 | 0.98 | 1.03 | 3.64 | 1.01 | 1.01 | 1.27 | 0.86 | 0.86 | 1.09 | 0.93 |
| 8 | 0.21 | 0.66 | 0.77 | 0.39 | 0.12 | 0.41 | 0.97 | 0.04 | 0.95 | 0.46 |
| 9 | 0.39 | 0.66 | 0.30 | 0.55 | 0.01 | 0.16 | 0.34 | 0.57 | 1.01 | 0.87 |
df$cluster <- mclust_result$classification
df %>%
count(cluster, genre) %>%
group_by(cluster) %>% slice_max(n, n = 5) %>% ungroup() %>%
ggplot(aes(reorder_within(genre, n, cluster), n, fill = factor(cluster)))+
geom_col(show.legend = FALSE)+coord_flip()+
facet_wrap(~cluster, scales = "free_y")+
scale_x_discrete()+theme_minimal()+
labs(title = "Top-5 Genres per Cluster", x = NULL, y = "Count")
df %>%
count(cluster, artist_name) %>%
group_by(cluster) %>% slice_max(n, n = 6) %>% ungroup() %>%
ggplot(aes(reorder_within(artist_name, n, cluster), n, fill = factor(cluster)))+
geom_col(show.legend = FALSE)+coord_flip()+
facet_wrap(~cluster, scales = "free_y")+
scale_x_discrete()+theme_minimal()+
labs(title = "Top-5 Most Frequent Artists per Cluster", x = NULL, y = "Count")
# ── SAVE PRE-COMPUTED ARTEFACTS FOR SHINY ──────────────────────────────
# This chunk will (re)compute PCA + clustering if needed, then persist:
# • pca_result (3‐factor PCA)
# • mclust_result (GMM with 9 clusters)
# • scores_df (PC scores + track_id + cluster)
# • df_with_cluster (original df + cluster label)
dir.create(file.path(root, "models"), showWarnings = FALSE)
# --- 1. Ensure PCA + clustering exist (uses data_scaled + df from earlier) ---
if (!exists("pca_result")) {
pca_result <- principal(data_scaled, nfactors = 3, rotate = "varimax")
}
if (!exists("mclust_result")) {
# force 9 clusters as you validated
mclust_result <- Mclust(data_scaled, G = 9)
}
# --- 2. Build scores_df ------------------------------------------------------
scores_df <- as_tibble(pca_result$scores, .name_repair = "unique") |>
set_names(c("PC1", "PC2", "PC3")) |>
mutate(
track_id = df$track_id,
cluster = factor(mclust_result$classification)
)
# --- 3. Persist artefacts ----------------------------------------------------
saveRDS(pca_result, file.path(root, "models/pca_result.rds"))
saveRDS(mclust_result, file.path(root, "models/mclust_result.rds"))
saveRDS(scores_df, file.path(root, "models/scores_df.rds"))
saveRDS(df, file.path(root, "models/df_with_cluster.rds"))
message("✅ Artefacts written to ~/Spotify-Clustering/models/")
# ── Top-10 tracks *within each cluster* by popularity ──────────────────
top10_by_cluster <- df %>%
group_by(cluster) %>%
slice_max(order_by = popularity, n = 10, with_ties = FALSE) %>%
ungroup() %>%
arrange(cluster, desc(popularity)) %>%
select(cluster,
artist = artist_name,
track = track_name,
popularity)
print(top10_by_cluster, n = nrow(top10_by_cluster))
## # A tibble: 90 × 4
## cluster artist track popularity
## <dbl> <chr> <chr> <int>
## 1 1 Post Malone "Sunflower - Spider-Man: Into the… 97
## 2 1 Sam Smith "Dancing With A Stranger (with No… 97
## 3 1 Marshmello "Happier" 97
## 4 1 Pedro Capó "Calma - Remix" 97
## 5 1 Anuel Aa "Secreto" 96
## 6 1 Lady Gaga "Shallow" 95
## 7 1 Meek Mill "Going Bad (feat. Drake)" 95
## 8 1 Post Malone "Better Now" 93
## 9 1 Ariana Grande "needy" 92
## 10 1 Alec Benjamin "Let Me Down Slowly" 92
## 11 2 Ariana Grande "bad idea" 91
## 12 2 Imagine Dragons "Thunder" 88
## 13 2 John Mayer "New Light" 84
## 14 2 Foster The People "Pumped Up Kicks" 83
## 15 2 benny blanco "I Found You (with Calvin Harris)" 82
## 16 2 Drake "Passionfruit" 82
## 17 2 Twenty One Pilots "Chlorine" 82
## 18 2 Calvin Harris "This Is What You Came For" 81
## 19 2 Billie Eilish "WHEN I WAS OLDER - Music Inspire… 81
## 20 2 Guns N' Roses "November Rain" 81
## 21 3 Aphex Twin "Avril 14th" 72
## 22 3 Max Richter "A Catalogue Of Afternoons" 72
## 23 3 Bon Iver "Holocene" 71
## 24 3 Peter Bradley Adams "Interlude for Piano" 71
## 25 3 Harold Budd "Olancha Farewell" 69
## 26 3 Brian McBride "Overture (For Other Halfs)" 69
## 27 3 Goldmund "Threnody" 69
## 28 3 Novo Amor "Anchor" 68
## 29 3 Radical Face "Welcome Home, Son" 68
## 30 3 Wolfgang Amadeus Mozart "Piano Concerto No. 21 in C Major… 68
## 31 4 Ariana Grande "7 rings" 100
## 32 4 J. Cole "MIDDLE CHILD" 96
## 33 4 Lauv "i'm so tired..." 93
## 34 4 Lil Baby "Drip Too Hard (Lil Baby & Gunna)" 92
## 35 4 Juice WRLD "Lucid Dreams" 92
## 36 4 Ariana Grande "NASA" 91
## 37 4 Mustard "Pure Water (with Migos)" 91
## 38 4 Ariana Grande "fake smile" 90
## 39 4 XXXTENTACION "Jocelyn Flores" 90
## 40 4 Drake "God's Plan" 89
## 41 5 Daddy Yankee "Con Calma" 98
## 42 5 Ava Max "Sweet but Psycho" 97
## 43 5 Ozuna "Baila Baila Baila" 95
## 44 5 Calvin Harris "Giant (with Rag'n'Bone Man)" 94
## 45 5 21 Savage "a lot" 93
## 46 5 Mau y Ricky "Desconocidos" 93
## 47 5 Mark Ronson "Nothing Breaks Like a Heart (fea… 92
## 48 5 Ellie Goulding "Close To Me (with Diplo) (feat. … 92
## 49 5 Ariana Grande "bloodline" 91
## 50 5 Jonas Brothers "Sucker" 91
## 51 6 Joyner Lucas "I'm Not Racist" 66
## 52 6 Drake "The Calm" 64
## 53 6 Kendrick Lamar "Mortal Man" 55
## 54 6 Frank Ocean "Facebook Story" 54
## 55 6 Eminem "Public Service Announcement" 54
## 56 6 JID "Frequency Change" 53
## 57 6 Duckwrth "I'M DEAD" 52
## 58 6 J. Cole "Kerney Sermon (Skit)" 52
## 59 6 Eminem "Steve Berman" 52
## 60 6 Hobo Johnson "3%" 52
## 61 7 Khalid "Better" 88
## 62 7 Billie Eilish "bellyache" 87
## 63 7 A$AP Rocky "Praise The Lord (Da Shine) (feat… 86
## 64 7 Billie Eilish "you should see me in a crown" 85
## 65 7 Led Zeppelin "Whole Lotta Love - 1990 Remaster" 77
## 66 7 Joji "YEAH RIGHT" 76
## 67 7 Elton John "I'm Still Standing" 76
## 68 7 Joji "CAN'T GET OVER YOU (feat. Clams … 74
## 69 7 Billie Eilish "Bellyache - Marian Hill Remix" 74
## 70 7 Twenty One Pilots "Levitate" 74
## 71 8 Billie Eilish "ocean eyes" 86
## 72 8 John Legend "All of Me" 85
## 73 8 John Mayer "I Guess I Just Feel Like" 80
## 74 8 John Lennon "Imagine - Remastered" 78
## 75 8 Ben&Ben "Kathang Isip" 78
## 76 8 Fleetwood Mac "Landslide" 78
## 77 8 Kacey Musgraves "Rainbow" 78
## 78 8 Kina Grannis "Can’t Help Falling in Love" 77
## 79 8 Ruben "Lay By Me" 77
## 80 8 The Beatles "Yesterday - Remastered 2009" 75
## 81 9 Ariana Grande "break up with your girlfriend, i… 99
## 82 9 Post Malone "Wow." 99
## 83 9 Halsey "Without Me" 97
## 84 9 DJ Snake "Taki Taki (with Selena Gomez, Oz… 96
## 85 9 Ariana Grande "thank u, next" 95
## 86 9 Paulo Londra "Adan y Eva" 95
## 87 9 Panic! At The Disco "High Hopes" 95
## 88 9 Bad Bunny "MIA (feat. Drake)" 95
## 89 9 Travis Scott "SICKO MODE" 94
## 90 9 Khalid "Talk" 94
ggplot(df, aes(factor(cluster), popularity, fill = factor(cluster)))+
geom_violin(trim = FALSE)+geom_boxplot(width = .1, outlier.shape = NA)+
theme_minimal()+labs(title = "Popularity by Cluster", x = "Cluster")
We clustered approximately 177,000 Spotify tracks using a Gaussian
Mixture Model (Mclust VEV, 9 clusters).
The goal: interpret each cluster’s sonic and genre profile, evaluate
coherence (via within-cluster variance), and identify actionable
business use cases.
Scope:
This diagnostic answers questions about cluster quality,
interpretability, and immediate business potential.
Future refinement opportunities are explicitly flagged as “Future
analysis”.
| Cluster | # Tracks | Variance Ratio vs. Overall | Interpretation |
|---|---|---|---|
| 1 | 37,059 | 0.50 | Tight, well-defined |
| 2 | 27,493 | 0.99 | Acceptable |
| 3 | 15,741 | 3.14 | High variance, poor cohesion 🔔 |
| 4 | 20,509 | 0.96 | Acceptable |
| 5 | 35,869 | 0.82 | Good homogeneity |
| 6 | 9,234 | 3.47 | Very high variance, poor cohesion 🔔 |
| 7 | 5,437 | 2.42 | High variance, weak cohesion 🔔 |
| 8 | 8,927 | 1.51 | Moderate dispersion |
| 9 | 16,505 | 0.72 | Good homogeneity |
🔔 Interpretive caution required for Clusters 3, 6, 7 due to poor within-cluster cohesion.
Below we summarise what each interactive view in the Shiny app dashboard just revealed about the nine sonic clusters and how a hiring manager can leverage those findings.
Top-5 genres per cluster bar-matrix
Implication: energy-driven clusters attract larger audiences; speech-dominant content is niche but valuable for ad/podcast placement.